Splice-Junction Gene Sequences

This  HyperNext Creator project shows how to implement a neural network project that can be used to recognise three categories -  exon/intron borders. intron/exon borders and neither of canonical patterns.


Data set

The dataset is symbolic and must be converted to a from the neural network can accept.

The first two lines from a typical dataset file are

EIX
GCGGGGTCGCTAAGGCCTCAGGAGGAGAAATGGCTCTCTGCAACCAGTTCTCTGCATCACE

The first line or file header represents the three output classes
   E - exon/intron border,     I - intron/exon border,      X - neither.

The second line consists of 60 symbols from the set A, C, G, T, D, N, S, R plus an output of either E, I or X representing the output classification for this sequence.

In addition to the standard DNA symbols D, N, S and A represent don't cares as defined below:-

    D =  A or G or T
    N =  A or C or G or T
    S =  C or G
    R =  A or G

Within the project this mapping is coded within the MakeMapInputs procedure defined in the MAINCODE section.


Coding

Each of the 60 inputs is coded into 4 bits so resulting in 240 inputs to the neural network.

   A = 1 0 0 0
   C = 0 1 0 0
   G = 0 0 1 0
   T = 0 0 0 1

the don't cares are coded probabilistically based on A C G T

   D = 0.33 0 0.33 0.33
   C = 0.25 0.25 0.25 0.25
   G = 0 0.5 0.5 0
   T = 0.5 0 0.5 0

The outputs are coded down from 3 outputs to 2 outputs and are represented by

   E = 1 0
   I  = 0 1
   X = 0 0


Training and Testing

There are three datasets provided with each being shuffled and named as follows -

   DNA tiny - 10 sequences

   DNA small - 200 sequences

   DNA large - 3190 sequences

 When training the neural network it is recommend that first time users should experiment with the DNA tiny dataset as the larger ones can be quite time consuming.

 The setup screen allows a dataset file to be loaded, shuffled and then divided into training and testing sections.
For instance, a 40% value indicates that 40% of the loaded file will be used for training and the remaining 60% used for testing. The neural network can also be tested on the training data.


Project itself

The project can be freely modified and shows various aspects of using HyperNext Creator and the BP1 neural network plugin. The code is not optimised and could be greatly improved, especially to make the mapping more flexible and expandable.

Note
  The project as set up can cope with dataset files have both Macintosh or UNIX line endings.

  When training the neural network the Escape key can be used to abort the training but on very slow machines there can be a substantial delay between pressing it and the training aborting.



